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Abstract 

A graph G'(V, E') is an e- sparsification of G for some e > 0, if every (weighted) cut in G' is within 
(1 ± e) of the corresponding cut in G. A celebrated result of Benczur and Karger shows that for every 
undirected graph G, an e-sparsification with 0(n log n/e 2 ) edges can be constructed in 0{m log 2 n) time. 
The notion of cut-preserving graph sparsification has played an important role in speeding up algorithms 
for several fundamental network design and routing problems. Applications to modern massive data sets 
often constrain algorithms to use computation models that restrict random access to the input. The semi- 
streaming model, in which the algorithm is constrained to use 0(n) space, has been shown to be a good 
abstraction for analyzing graph algorithms in applications to large data sets. Recently, a semi-streaming 
algorithm for graph sparsification was presented by Anh and Guha; the total running time of their imple- 
mentation is tt(mn), too large for applications where both space and time are important. In this paper, we 
introduce a new technique for graph sparsification, namely refinement sampling, that gives an 0{m) time 
semi-streaming algorithm for graph sparsification. 

Specifically, we show that refinement sampling can be used to design a one-pass streaming algorithm 
for sparsification that takes O(loglogn) time per edge, uses 0(log 2 n) space per node, and outputs an 
e-sparsifier with 0(nlog 3 n/e 2 ) edges. At a slightly increased space and time complexity, we can reduce 
the sparsifier size to 0(n\ogn/e 2 ) edges matching the Benczur-Karger result, while improving upon the 
Benczur-Karger runtime for m = w(nlog 3 n). Finally, we show that an e-sparsifier with 0(n log n/e 2 ) 
edges can be constructed in two passes over the data and 0(m) time whenever m = il(n 1+s ) for some 
constant S > 0. As a by-product of our approach, we also obtain an 0(m log log n + nlogn) time 
streaming algorithm to compute a sparse fc-connectivity certificate of a graph. 
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1 Introduction 



The notion of graph sparsification was introduced in IIBK96I . where the authors gave a near linear time pro- 
cedure that takes as input an undirected graph G on n vertices and constructs a weighted subgraph H of G 
with 0(n log n/e 2 ) edges such that the value of every cut in H is within a 1 ± e factor of the value of the 
corresponding cut in G. This algorithm has subsequently been used to speed up algorithms for finding approx- 
imately minimum or sparsest cuts in graphs ( HBK961 IKRV06ID . as well as in a host of other applications (e.g. 
HKL0210 . A more general class of spectral sparsifiers was recently introduced by Spielman and Srivastava in 
HSS08II . The algorithms developed in IBK961 and MSS081 take near-linear time in the size of the graph and 
produce very high quality sparsifiers, but require random access to the edges of the input graph G, which is of- 
ten prohibitively expensive in applications to modern massive data sets. The streaming model of computation, 
which restricts algorithms to use a small number of passes over the input and space polylogarithmic in the 
size of the input, has been studied extensively in various application domains (e.g. HMut0610 . but has proven 
too restrictive for even the simplest graph algorithms (even testing s — t connectivity requires Q(n) space). 
The less restrictive semi-streaming model, in which the algorithm is restricted to use 0(n) space, is more 
suited for graph algorithms ||FKM + 05| . The problem of constructing graph sparsifiers in the semi-streaming 
model was recently posed by Anh and Guha MAG0911 , who gave a one-pass algorithm for finding Benczur- 
Karger type sparsifiers with a slightly larger number of edges than the original Benczur-Karger algorithm, i.e. 
0(n log n log as opposed to 0(ralogn/e 2 ). Their algorithm requires only one pass over the data, and 

their analysis is quite non-trivial. However, its time complexity is Qimn polylog(n)), making it impractical 
for applications where both time and space are important constraint^ 

Apart from the issue of random access vs disk, the semi-streaming model is also important for scenarios 
where edges of the graph are revealed one at a time by an external process. For example, this application 
maps well to online social networks where edges arrive one by one, but efficient network computations may 
be required at any time, making it particularly useful to have a dynamically maintained sparsifier. 



Our results: We introduce the concept of refinement sampling. At a high level, the basic idea is to sample 
edges at geometrically decreasing rates, using the sampled edges at each rate to refine the connected com- 
ponents from the previous rate. The sampling rate at which the two endpoints of an edge get separated into 
different connected components is used as an approximate measure of the "strength" of that edge. We use 
refinement sampling to obtain two algorithms for computing Benczur-Karger type sparsifiers of undirected 
graphs in the semi-streaming model efficiently. The first algorithm requires O(logn) passes, O(logn) space 
per node, 0(log n log log n) work per edge and produces sparsifiers with 0(n log 2 n/e 2 ) edges. The second 
algorithm requires one pass over the edges of the graph, 0(log 2 n) space per node, O (log log n) work per 
edge and produces sparsifiers with 0(n log 3 n/e 2 ) edges. Several properties of these results are worth noting: 

1. In the incremental model, the amortized running time per edge arrival is 0(log log n), which is quite 
practical and much better than the previously best known running time of O(n). 

2. The sample size can be improved for both algorithms by running the original Bencur-Karger algo- 
rithm on the sampled graph without violating the restrictions of the semi-streaming model, yielding 
0(log n log log n + (— ) log 4 n) and 0(iog log n + (^) log 5 n) amortized work per edge respectively. 

3. Somewhat surprisingly, this two-stage (but still semi-streaming) algorithm improves upon the runtime 
of the original sparsification scheme when m = uj(nlog 2 n) for the 0(log n)-pass version and m = 
uj{n log 3 n) for the one-pass version. 



1 As is often the case for semi-streaming algorithms, Anh and Guha do not explicitly compute the running time of their algorithm; 
Sl(mn polylog(n)) is the best running time we can come up with for their algorithm. 
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4. As a by-product of our analysis, we show that refinement sampling can be regarded as a one-pass 
algorithm for producing a sparse connectivity certificate of a weighted undirected graph (see Corollary 
I4.7I ). Thus we obtaining a streaming analog of the Nagamochi-Ibaraki result MNI921 for producing 
sparse certificates, which is in turn used in the Bencur-Karger sampling. 

Finally, in Section [5] we give an algorithm for constructing 0(n log n/e 2 ) -size sparsifiers in 0(m) time 
using two passes over the input when m = Q(n 1+S ). 

Related Work: In MAG091 the authors give an algorithm for sparsification in the semi-streaming model 
based on the observation that one can use the constructed sparsification of the currently received part of 
the graph to estimate of the strong connectivity of a newly received edge. A brief outline of the algorithm 
is as follows. Denote the edges of G in their order in the stream by ei, . . . , e m . Set Hq = (V, 0). For 
every t > compute the strength st of e t in H t -\, and with probability p et = mm{p/st, 1} set H t = 
(V, E(H t -i) U {e^}), giving et weight l/p et m H t and H t = H t ~\ otherwise. For every t the graph H t 
is an e-sparsification of the subgraph received by time t. The authors show that this algorithm yields an 
e-sparsifier with 0(ra log n log — /e 2 ) edges. However, it is unclear how one can calculate the strengths st 
efficiently. A naive implementation would take fl(n) time for each t, resulting in Q(mn) time overall. One 
could conceivably use the fact that is always a subgraph of H t , but to the best of our knowledge there 
are no results on efficiently calculating or approximating strong connectivities in the incremental model. 

It is important to emphasize that our techniques for obtaining an efficient one-pass sparsification algorithm 
are very different from the approach of HAG09L In particular, the structure of dependencies in the sampling 
process is quite different. In the algorithm of BAG091 edges are not sampled independently since the probabil- 
ity with which an edge is sampled depends on the the coin tosses for edges that came earlier in the stream. Our 
approach, on the other hand, decouples the process of estimating edge strengths from the process of producing 
the output sample, thus simplifying analysis and making a direct invocation of the Benczur-Karger sampling 
theorem possible. 

Organization: Section |2] introduces some notation as well as reviews the Benczur-Karger sampling algo- 
rithm. We then introduce in Section [3] our refinement sampling scheme, and show how it can be used to 
obtain a sparsification algorithm requiring O(logn) passes and O(lognloglogn) work per edge. The size 
of the sampled graph is 0(nlog 2 n/e 2 ), i.e. at most O(logn) times larger than that produced by Benczur- 
Karger sampling. Finally, in Section |4]we build on the ideas of Section[3]to obtain a one-pass algorithm with 
0(log logn) work per edge at the expense of increasing the size of the sample to 0(n log 3 n/e 2 ). 

2 Preliminaries 

We will denote by G(V, E) the input undirected graph with vertex set V and edge set E with \V\ = n 
and \E\ = m. For any e > 0, we say that a weighted graph G'(V,E') is an e-sparsification of G if every 
(weighted) cut in G' is within (1 ± e) of the corresponding cut in G. Given any two collections of sets that 
partition V, say S\ and S2 , we say that £2 is a refinement of S\ if for any X G Si and Y € S2, either 
X (~l Y = or Y C X. In other words, S\ U S2 form a laminar set system. 

2.1 Benczur-Karger Sampling Scheme 

We say that a graph is k-connected if the value of each cut in G is at least k. The Benczur-Karger sampling 
scheme uses a more strict notion of connectivity, referred to as strong connectivity, defined as follows: 

Definition 2.1 HBK96\I A fc-strong component is a maximal k-connected vertex-induced subgraph. The strong 
connectivity of an edge e, denoted by s e , is the largest k such that a k-strong component contains e. 
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Note that the set of £>strong components form a partition of the vertex set of G, and the set of k + 1-strong 
components forms a refinement this partition. We say e is k-strong if its strong connectivity is k or more, and 
k-weak otherwise. The following simple lemma will be useful in our analysis. 

Lemma 2.2 [BK96^ The number of k-weak edges in a graph on n vertices is bounded by k(n — 1). 

The sampling algorithm relies on the following result: 

Theorem 2.3 IB K96\l Let G' be obtained by sampling edges ofG with probability p e = min{^-, 1}, where 
p = 16(d + 2) Inn, and giving each sampled edge weight l/p e - Then G' is an e-sparsification of G with 
probability at least 1 — n~ d . Moreover, expected number of edges in G' is 0(n log n). 

It follows easily from the proof of theorem 1231 in [BK96] that if we sample using an underestimate of 
edge strengths, the resulting graph is still an e-sparsification. 

Corollary 2.4 Let G' be obtained by sampling each edge of G with probability p e > p e and and give every 
sampled edge e weight l/p e . Then G' is an e-sparsification ofG with probability at least 1 — n~ d . 

In MBK961 the authors give an 0(m log 2 n) time algorithm for calculating estimates of strong connec- 
tivities that are sufficient for sampling. The algorithm, however, requires random access to the edges of the 
graph, which is disallowed in the semi-streaming model. More precisely, the procedure for estimating edge 
strengths given in MBK96I relies on the Nagamochi-Ibaraki algorithm for obtaining sparse certificates for edge- 
connectivity in 0(m) time (" IINI92I0 . The algorithm of MNI921 relies on random access to edges of the graph 
and to the best of our knowledge no streaming implementation is known. In fact we show in Corollary 14.71 
that refinement sampling yields a streaming algorithm for producing sparse certificates for edge-connectivity 
in one pass over the data. 

In what follows we will consider unweighted graphs to simplify notation. The results obtained can be 
easily extended to the polynomially weighted case as outlined in Remark l4~8l at the end of SectionUl 

3 Refinement Sampling 

We start by introducing the idea of refinement sampling that gives a simple algorithm for efficiently computing 
a BK-sample, and serves as a building block for our streaming algorithms. 

To motivate refinement sampling, let us consider the simpler problem of identifying all edges of strength 
at least k in the input graph G(V, E). A natural idea to do so is as follows: (a) generate a graph G' by sampling 
edges of G with probability 0(1/ k), (b) find connected components of G', and (c) output all edges (u, v) G E 
as such that u and v are in the same connected component in G'. The sampling rate of 0(1/ k) suggests that if 
an edge (u, v) has strong connectivity below k, the vertices u and v would end up in different components in 
G' , and conversely, if the strong connectivity of (u, v) is above k, they are likely to stay connected and hence 
output in step (c). While this process indeed filters out most k-weak edges, it is easy to construct examples 
where the output will contain many edges of strength 1 even though k is polynomially large (a star graph, for 
instance). The idea of refinement sampling is to get around this by successively refining the sample obtained 
in the final step (c) above. 

In designing our algorithm, we will repeatedly invoke the subroutine Refine (S,p) that essentially imple- 
ments the simple idea described above. 

Function: Refine (S, p) 

Input: Partition S of V, sampling probability p. 
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Output: Partition S' of V, a refinement of S. 

1. Take a uniform sample E' of edges of E with probability p. 

2. For each U G S,U C V let C(£7) be the set of connected components of [/ induced by E' . 

3. Return S' := U UeS C(U). 

It is easy to see that Refine can be implemented using 0(n) space, a total of n Union operations with 
0{n log n) overall cost and m Find operations with O(l) cost per operation, for an overall running time of 
0(relogn + m)(see, e.g. MCLRSOIIO . Also, Refine can be implemented using a single pass over the set of 
edges. A scheme of refinement relations between S^k is given in Fig. [T] 

The refinement sampling algorithm computes partitions Sij for I = 1, . . . , L and j = 0, 1, . . . , K. Here 
L = log(2n) is the number of strength levels (the factor of 2 is chosen for convenience to ensure that Sl,k 
consists of isolated vertices whp), K is a parameter which we call the strengthening parameter. Also, we 
choose a parameter <fi > 0, which we will refer to as the oversampling parameter. For a partition S, let X(S) 
denote all the edges in E which have endpoints in two different sets in S. The partitions are computed as 
follows: 

Algorithm 1 (Refinement Sampling) 
Initialization: S/ o = {V} for I = 1, . . . , L. 

1. Set k := 1 

2. For each 1, 1 < I < L, set Si, k ■= Refine(S^._i, 2~ z ). 

3. Set k := k + 1. If k < K, go to step 1. 

4. Foreache € E define L(e) = min{Z : e 6 X(Si : k)}- Sample edge e with probability z(e) = min{l, 22 t(e) I 

and assign it weight l/z(e). Let R(4>, K) denote the set of edges sampled during this step; we call this 
the refinement sample of G. 

The following two lemmas relate the probabilities z{e) to the sampling probabilities used in the Benczur- 
Karger sampling scheme. 

Lemma 3.1 For any K > 0, with probability at least 1 — Kn~ d every edge e satisfies z(e) < 4c/)p/(e 2 s e ). 

Proof: Consider an edge e with strong connectivity s e , and let C denote the s e -strongly connected component 
containing e. By Theorem 12.31 sampling with probability min{4/9/s e , 1} preserves all cuts up to 1 ± | in C 
with probability at least 1 — n~ d . Hence, all s e -strongly connected components stay connected after K passes 
of Refine for all / > such that 2~ l > Ap/s e , yielding the lemma. ■ 

Lemma 3.2 If K > log 4 / 3 n, then 2~ L ^ +1 > l/(2s e ) for every e G E(G) with probability at least 1 — 

^ e -(n-l)/100_ 

Proof: Consider a level I such that p = 2~ l < l/(2s e ). Let H be the graph obtained by contracting all 
(s e + l)-strong components in G into supernodes. Since H contains only (s e + l)-weak edges, the number 
of edges is at most s e (n — 1) by Lemma I2T21 As the expected number of (s e + l)-weak edges in the sample 
is at most (n — l)/2, by Chernoff bounds, the probability that the number of (s e + l)-weak edges in the 
sample exceeds 3(n - l)/4 is at most (e 1/4 (5/4)" 5 / 4 )"( n-1 )/ 2 < e -("- 1 )/ 1 oo_ xhus at least one quarter 
of the supernodes get isolated in each iteration. Hence, no (s e + l)-weak edge survives after K = log 4 / 3 n 
rounds of refinement sampling with probability at least 1 — Ke~( n ~ l ^ wo . Since L(e) was defined as the least 
I such that e G X(S^k), the endpoints of e were connected in SWe)-!,!^ so 2 -L ( e ) +1 > l/(2s e ). ■ 
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Figure 1 : Scheme of refinement relations between partitions for Algorithm 1 . 



Theorem 3.3 Let G' be the graph obtained by running Algorithm 1 with 4> ■ = 4p. Then G' has O (n log 2 n/e 2 
edges in expectation, and is an e-sparsification of G with probability at least 1 — n 



-d+l 



Proof: We have from lemma 13.21 and the choice of that the sampling probabilities dominate those used 
in Bencziir-Karger sampling with probability at least 1 — Ke _( - n_1 ^ 100 . Hence, by corollary 12.41 we have 
that every cut in G' is within 1 ± e of its value in G with probability at least 1 — Ke~^ n ~ 1 '' 100 — n~ d . The 
expected size of the sample is 0(n log 2 n/e 2 ) by lemma I3TT1 together with the fact that p = O(logn). The 
probability of failure of the estimate in lemma [3721 is at most Kn~ d , so all bounds hold with probability at 
least 1 — Kn~ d + Ke - ^ -1 ^ 100 — n~ d > 1 — n~ d+1 for sufficiently large n. The high probability bound on 
the number of edges follows by an application of the Chernoff bound. ■ 

The next lemma follows from the discussion above: 



Lemma 3.4 For any e > 0, an e-sparsification of G with 0(n log 2 n/e 2 ) edges can be constructed in 
0(log n) passes o/Refine using 0(log n) space per node and 0(log 2 n) time per edge. 



We now note that one log n factor in the running time comes from the fact that during each pass k Algo- 
rithm 1 flips a coin at every level I to decide whether or not to include e into S*/ ^ when e € Si^-i- If we could 
guarantee that Si t k is a refinement of Si> t k for all I' < I and for all k, we would be able to use binary search 
to find the largest I such that e G Si f. in O (log log n) time. Algorithm 2 given below uses iterative sampling 
to ensure a scheme of refinement relations given in Fig. |2] For each edge e, 1 < k < K , and 1 < £ < L, 
we define for convenience independent Bernoulli random variables Ai^ e such Yr[Ai^, e = 1] = 1/2, even 
though the algorithm will not always need to flip all these 0(log 2 n) coins. Also define Ui y k, e = Tlj<i ^j,k,e- 
The algorithm uses connectivity data structures Di^, 1 < I < L,l < k < K. Adding an edge e to 
merges the components that the endpoints of e belong to in D\ y.. 
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Algorithm 2 (An 0(log re) -Pass Sparsifier) 

Input: Edges of G streamed in adversarial order: (ei, . . . , e m ). 
Output: A sparsification G' of G. 
Initialization: Set E' := 0. 

1. For all k = 1,..., K 

2. Set t = 1. 

3. ForallZ = 1,...,L 

4. Add e t = (u t ,vt) to if Ui^ e = 1 and u t and v t are connected in D n fe-i). 

5. Set t := t + 1. Go to step 1 if t < m. 

6. For each e t define L'(et) as the minimum I such that u t and v t are not connected in D^k- Set z'(et) := 

min jl, 2 tfJ(e t ) }■ Output e< with probability z'(et), giving it weight l/z'(et). 

Theorem 3.5 For any e > 0, there exists an 0(log n)-pass streaming algorithm that produces an e-sparsification 
G' of a graph G with at most O (re log 2 n/e 2 ) edges using 0((n/m) log n + log n log log re) time per edge. 

Proof: The correctness of Algorithm 2 follows in the same way as for Algorithm 1 above, so it remains 
to determine its runtime. An 0((re/m) log n + 1) term per edge comes from amortized 0(n log re + m) 
complexity of UNION-FIND operations. The log re factor in the runtime comes from the log n passes, and 
we now show that step 3 can be implemented in 0(log log n) time. First note that since Sy^i is a refinement 
of Si t k whenever I' > I and k' > k, one can use binary search to determine the largest Iq such that ut and vt 
are connected in -Dzo— l- One then keeps flipping a fair coin and adding e to connectivity data structures 
Di k for successive I > Iq as long as the coin keeps coming up heads. Since 2 such steps are performed on 
average, it takes O(K) = O(logn) amortized time per edge by the Chernoff bound. Putting these estimates 
together, we obtain the claimed time complexity. ■ 
The scheme of refinement relations between S^f. is depicted in Fig. [2 

Corollary 3.6 For any e > 0, there is an 0(logn)-pass algorithm that produces an e-sparsification G' 
of an input graph G with at most 0(n log n/e 2 ) edges using 0(log 2 n) space per node, and performing 
0(log n log log n + (n/rre) log 4 n) amortized work per edge. 

Proof: One can obtain a sparsification G' with O (re log 2 n/e 2 ) edges by running Algorithm 2 on the input 
graph G, and then run the Benczur-Karger algorithm on G' without violating the restrictions of the semi- 
streaming model. Note that even though G' is a weighted graph, this will have overhead 0(log 2 n) per edge 
of G' since the weights are polynomial. Since G' has O (re log 2 re) edges, the amortized work per edge of 
G is 0(logreloglogn + (re/m)log 4 re). The Benczur-Karger algorithm can be implemented using space 
proportional to the size of the graph, which yields 0(log 2 re) space per node. ■ 

Remark 3.7 The algorithm improves upon the runtime of the Benczur-Karger sparsification scheme when 
m = uj (re log 2 re). 



6 



4 A One-pass 0(n + m)-Time Algorithm for Graph Sparsification 



In this section we convert Algorithm 2 obtained in the previous section to a one-pass algorithm. We will 
design a one-pass algorithm that produces an e-sparsifier with 0(n log 3 n/e 2 ) edges using only O (log log n) 
amortized work per edge. A simple post-processing step at the end of the algorithm will allow us to reduce the 
size to 0(n log n/e 2 ) edges with a slightly increased space and time complexity. The main difficulty is that 
in going from O(logn) passes to a one-pass algorithm, we need to introduce and analyze new dependencies 
in the sampling process. 

As before, the algorithm maintains connectivity data structures D^k, where 1 < I < L and 1 < k < K. 
In addition to indexing Di t k by pairs (I, k) we shall also write Dj for Di t k, where J = K{1 — 1) + k, so that 
1 < J < LK. This induces a natural ordering on Di^, illustrated in Fig. [3l that corresponds to the structure 
of refinement relations. We will assume for simplicity of presentation that Do = Dip is a connectivity data 
structure in which all vertices are connected. For each edge e, 1 < I < L, and 1 < k < K, we define an 
independent Bernoulli random variable A[ k with Pr[A^ k = 1] = 2 . The algorithm is as follows: 

Algorithm 3 (A One-Pass Sparsifier) 

Input: Edges of G streamed in adversarial order: (ei, . . . , e m ). 
Output: A sparsification G' of G. 
Initialization: Set E' := 0. 

1. Set t = 1. 

2. For all J = 1, . . . , LK (J = (I, k)) 

3. Add et = («t, vt) to Dj if A[ , = 1 and u t and vt are connected in Dj_i. 

4. Define L'(et) as the minimum I such that u t and vt are not connected in D\ ^. Set z' (et) 

Output et with probability z' (et), giving it weight 1/ z' {et). 

5. Set t := t + 1. Go to step 2 if t < m. 

Informally, Algorithm 3 underestimates strength of some edges until the data structures Di ^ become 
properly connected but proceeds similarly to Algorithms 1 and 2 after that. Our main goal in the rest of the 
section is to show that this underestimation of strengths does not lead to a large increase in the size of the 
sample. 

Note that not all LK = 0(log 2 n) coin tosses A[ k per edge are necessary for an implementation of 
Algorithm 3 (in particular, we will show that Algorithm 3 can be implemented with O (log log n) = o(LK) 
work per edge). However, the random variables A[ k e are useful for analysis purposes. We now show that 
Algorithm 3 outputs a sparsification G' of G with 0(n log 3 n/e 2 ) edges whp. 

Lemma 4.1 For any e > 0, w.h.p. the graph G' is an e-sparsification of G. 

Proof: We can couple behaviors of Algorithms 1 and 3 using the coin tosses A\ k to show that L(e) > L'(e) 
for every edge e, i.e. z'(e) > z(e). Hence G' is a sparsification of G by Corollary 12.41 ■ 
It remains to upper bound the size of the sample. The following lemma is crucial to our analysis; its proof 
is deferred to the Appendix lAl due to space limitations. 

Lemma 4.2 Let G(V, E) be an undirected graph. Consider the execution of Algorithm 3, and for 1 < J < 
LK where J = (I, k), let X J denote the set of edges e = (u, v) such that u and v are connected in Dj—\ 
when e arrives. Then \E \ X J \ = 0(K2 l n) with high probability. 



:= min \ 1, „ , >. 
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Lemma 4.3 The number of edges in G' is 0(n log 3 n/e 2 ) with high probability. 



Proof: Recall that Algorithm 3 samples an edge e t = (ut,vt) with probability z'(et) = min jl, 2 ^J{e t ) }» 
where L'{e t ) is the minimum / such that u t and are not connected in D^k- As before, for J = (/, k), we 
denote by X J the set of edges e = (u, v) such that u and u are connected in Dj-% when e arrives. Note that 
w.h.p. XW = w.h.p. by our choice of L = log(2n). For each 1 < I < L, let Y t = X^ \ X^ l+1 > l \ 
We have by Lemma l4~2l that Yli<j<i l^j'l = 0(K2 l n) w.h.p. Also note that edges in Y\ are sampled with 

e 2 2 l 



probability at most 2 ^f_ 1 . Hence, we get that the expected number of edges in the sample is at most 



E m ■ ^fel = O ( E • ) = ^(nlog 3 n/e 2 ). 

^=1 



£ 2 2 i-l \ ' e 2 2 i-l 



The high probability bound now follows by standard concentration inequalities. ■ 
Finally, we have the following theorem. 

Theorem 4.4 For any e > and d > 0, there exists a one-pass algorithm that given the edges of an undi- 
rected graph G streamed in adversarial order, produces an e-sparsifier G' with 0(n log 3 n/e 2 ) edges with 
probability at least 1 — n~ d . The algorithm takes O(loglogn) amortized time per edge and uses 0(log 2 n) 
space per node. 

Proof: Lemma l4~T1 and Lemma [43] together establish that G' is an e-sparsifier G' with 0(n log 3 n/e 2 ) edges. 
It remains to prove the stated runtime bounds. 

Note that when an edge ej = (ut,v t ) is processed in step 3 of Algorithm 3, it is not necessary to add 
e t to any data structure Dj in which u t and vt are already connected. Also, since Dj is a refinement of 
Dji whenever J' < J, for every edge et there exists J* such that u t and v t are connected in Dj for any 
J < J* and not connected for any J > J*. The value of J* can be found in O (log log n) time by binary 
search. Now we need to keep adding et to Dj, for each J > J* such that Ui t k,e t = 1- However, we have that 
E Sj>j* U'lk e t = ^(-0- Amortizing over all edges, we get O(l) per edge using standard concentration 
inequalities. ■ 

Corollary 4.5 For any e > and d > 0, there exists a one-pass algorithm that given the edges of an 
undirected graph G streamed in adversarial order, produces an e-sparsifier G' with 0(n log n/e 2 ) edges with 
probability at least 1 — n~ d . The algorithm takes amortized 0(log log n + (n/m) log 5 n) time per edge and 
uses 0(log 3 n) space per node. 

Proof: One can obtain a sparsification of G' with 0(n log 3 n/e 2 ) edges by running Algorithm 3 on the input 
graph G, and then run the Benczur-Karger algorithm on G' without violating the restrictions of the semi- 
streaming model. Note that even though G' is a weighted graph, this will have overhead 0(log 2 n) per edge 
of G' since the weights are polynomial. Since G' has 0(n log 3 n) edges, the amortized work per edge of 
G is 0(lognloglogn + (n/m) log 5 n). The Benczur-Karger algorithm can be implemented using space 
proportional to the size of the graph, which yields 0(log 3 n) space per node. ■ 

Remark 4.6 The algorithm avove improves upon the runtime of the Benczur-Karger sparsification scheme 
when m = u)(n log 3 n). 



Sparse fc-connectivity Certificates: Our analysis of the performance of refinement sampling is along broadly 
similar lines to the analysis of the strength estimation routine in [BK96]. To make this analogy more precise, 
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we note that refinement sampling as used in Algorithm 3 in fact produces a sparse connectivity certificate of 
G, similarly to the algorithm of Nagamochi-Ibaraki llNI92l . although with slightly weaker guarantees on size. 

A k-connectivity certificate, or simply a k-certificate, for an n-vertex graph G is a subgraph H of G 
such that contains all edges crossing cuts of size k or less in G. Such a certificate always exists with 0{kn) 
edges, and moreover, there are graphs where U(kn) edges are necessary. The algorithm of MNI921 depends on 
random access to edges of G to produce a /c-certificate with O(kn) edges in 0(m) time. We now show that 
refinement sampling gives a one-pass algorithm to produce a fc-certificate with 0(kn log 2 n) edges in time 
0(m log log n + n log n). The result is summarized in the following corollary: 

Corollary 4.7 Whpfor each I > 1 the set X(D^k) is a 2 l -certificate ofG with 0(log 2 n)2 l n edges. 

Proof: Whp X(D^k) contains all 2 ? -weak edges, in particular those that cross cuts of size at most 2 l . The 
bound on the size follows by Lemma l4~2l ■ 

Remark 4.8 Algorithms 1-3 can be easily extended to graphs with polynomially bounded integer weights 
on edges. If we denote by W the largest edge weight, then it is sufficient to set the number of levels L to 
log(2nW) instead of\og{2n) and the number of passes to log 4 / 3 nW instead o/log^ n. A weighted edge is 
then viewed as several parallel edges, and sampling can be performed efficiently for such edges by sampling 
directly from the corresponding binomial distribution. 

5 A Linear- time Algorithm for 0(n log n / e 2 ) -size Sparsifiers 

We now present an algorithm for computing an e-sparsification with 0(ralogn/e 2 ) edges in 0(m log 4 + 
n 1+s ) expected time for any 5 > 0. Thus, the algorithm runs in linear-time whenever m = Q(n 1+n ^). 
We note that no (randomized) algorithm can output an e-sparsification in sub-linear time even if there is no 
restriction on the size of the sparsifier. This is easily seen by considering the family of graphs formed by 
disjoint union of two n-vertex graphs G\ and G2 with m edges each, and a single edge e connecting the two 
graphs. The cut that separates G\ from G2 has a single edge e, and hence any e-sparsifier must include e. On 
the other hand, it is easy to see that fi(m) probes are needed in expectation to discover the edge e. 

Our algorithm can in fact be viewed as a two-pass streaming algorithm, and we present is as such below. 
As before, let G = (V, E) be an undirected unweighted graph. We will use Algorithm 3 as a building block 
of our construction. We now describe each of the passes. 

First pass: Sample every edge of G uniformly at random with probability p = 4/ log n. Denote the resulting 
graph by G' = (V,E'). Give the stream of sampled edges to Algorithm 3 as the input stream, and 
save the state of the connectivity data structures D^k for all 1 < I < L at the end of execution. For 
1 < I < L, let Df denote these connectivity data structures (we will also refer to D\ as partitions in 
what follows). 

Note that the first pass takes 0{m) expected time since Algorithm 3 has an overhead O (log log n) time 
per edge and the expected size of \E'\ is \E\/ log n. 

Recall that the partitions Df are used in Algorithm 3 to estimate strength of edges e € E'. We now show 
that these partitions can also be used to estimate strength of edges in E. The following lemma establishes a 
relationship between the edge strengths in G' and G. For every edge e G E, let s' e denote the strength of edge 
e in the graph G' e (V, E' U {e}). 

Lemma 5.1 Whp s' e < s e < 2s' e log n + p log nfor all e £ E, where p = 16(d + 2) In n is the oversampling 
parameter in Karger sampling. 
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Proof: The first inequality is trivially true since G' e is a subgraph of G. For the second one, let us first 
consider any edge e G E with s e > plogn. Let C be the s e -strong component in G that contains the edge 
e. By Karger's theorem, whp the capacity of any cut defined by a partition of vertices in C decreases by a 
factor of at most 21ogra after sampling edges of G with probability p = 4/logn = p/((l/2) 2 plogn), i.e. 
in going from G to G' . So any cut in C, restricted to edges in E' has size at least s e j (2 log n), implying that 
s' e > s e /(21ogn). Finally, for any edge e with s e < plogn, s' e is at least 1, and the inequality thus follows. 

■ 

We now discuss the second pass over the data. Recall that in order to estimate the strength s' e of an edge 
e G E' , Algorithm 3 finds the minimum L(e) such that the endpoints of e are not connected in by doing 
a binary search over the range [1..L]. For an edge e G G we estimate its strength in G' e by doing binary 
search as before, but stopping the binary search as soon as the size of the interval is smaller than 5L, thus 
taking 0(log I) time per edge and obtaining an estimate that is away from the true value by a factor of at 
most n s . Let s" e denote this estimate, that is, s' e n~ s < s" e < s' e n s . Now sampling every edge with probability 

p e = min j^77, 1 j and giving each sampled edge weight l/p e yields an e-sparsification G" = (V, E") of G 

whp. Moreover, we have that w.h.p. \E"\ = 0(n 1+<5 ). Finally, we provide the graph G" as input to Algorithm 
3 followed by applying Benczur-Karger sampling as outlined in Corollary 14.51 obtaining a sparsifier of size 
0(n log n/e 2 ). We now summarize the second pass. 

Second pass: For each edge e of the input graph G: 

• Perform 0(log i) steps of binary search to calculate s". 

• Sample edge e with probability p e = min{-^77, 1}. 

• If e is sampled, assign it a weight of l/p e , and pass it as an input to a fresh invocation of Al- 
gorithm 3, followed by Benczur-Karger sampling as outlined in Corollary 14.51 giving the final 
sparsification. 

Note that the total time taken in the second pass is 0(m log |) + 0(n 1+<5 ). We have proved the following 

Theorem 5.2 For any e > and 5 > 0, there exists a two-pass algorithm that produces an e-sparsifier in 
time 0(m log 4) + 0(n 1+s ). Thus the algorithm runs in linear-time when m = f2(n 1+<s ) and 5 is constant. 
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Figure 2: Scheme of refinement relations for Algorithm 2. 
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A Proof of Lemma 4.2 



We denote the edges of G in their order in the stream by E = (ei, . . . , e m ). In what follows we shall treat 
edge sets as ordered sets, and for any E\ C E write E \ E\ to denote the result of removing edges of E\ from 
E while preserving the order of the remaining edges. For a stream of edges E we shall write Et to denote the 
set of the first t edges in the stream. 

For a K-connected component C of a graph G we will write \C\ to denote the number of vertices in C. 
Also, we will denote the result of sampling the edges of C uniformly at random with probability p by C . The 
following simple lemma will be useful in our analysis: 

Lemma A.l Let C be a ^-connected component of G for some positive integer k. Denote the graph obtained 
by sampling edges of C with probability p > X/k by C'. Then the number of connected components in C' is 
at most 7|C| with probability at least 1 — e~^ c \ where 7 = (7/8 + e~ A / 2 /8) and 77 = 1 — e~ A / 2 . 

Proof: Choose A,B C V(C) so that A U B = V(C), A n B = 0, \A\ > \V(C)\/2 and for every v € A 
at least half of its edges that go to vertices in C go to B. Note that such a partition always exists: starting 
from any arbitrary partition of vertices of C, we can repeatedly move a vertex from one side to the other if 
it increases the number of edges going across the partition, and upon termination, the larger side corresponds 
to the set A. Denote by Y the number of vertices of A that belong to components of size at least 2. Note 
that Y can be expressed as sum of \A\ independent 0/1 Bernoulli random variables. Let fi := E[Y]; we 
have that y. > \A\(1 - (1 - A/k) k / 2 ) > \A\(1 - e" A / 2 ). We get by the Chernoff bound that Pr[Y < 
\A\(1 - e~ A / 2 )/2] < e~ 2 ^ < e - |C| (i- e -^/ 2 ) = e -v\C\_ Hence? at least a ^ _ e -A/2y 4 f ract i on Q f t h e 
vertices of C are in components of size at least 2. Hence, the number of connected components is at most a 
1 - (1 - e~ A / 2 )/8 = 7/8 + e~ A / 2 /8 = 7 fraction of the number of vertices of C. ■ 



II 



Proof of Lemma l472t The proof is by induction on J. We prove that w.h.p. for every J = (I, k) one has 

\E \ X J \ < Yli<j'=(i' ,k')<j-i n f° r a constant c\ > 0. 

Base: J = 1 Since everything is connected in D by definition, the claim holds. 

Inductive step: J — > J + 1 The outline of the proof is as follows. For every J = (Z, k) we consider the 
edges of the stream that the algorithm tries to add to Dj, identify a sequence of 2 l -strongly connected 
components Cq,C\. . . in the partially received graph, and use lemma lATTI to show that the number 
of connected components decreases fast because only a small fraction of vertices in the sampled 2 l - 
strongly connected components are isolated. We thus show that, informally, it will take 0(2 l n) edges 
to make the connectivity data structure Dj in Algorithm 3 connected. The connected components C s 
are defined by induction on s. The vertices of C s are elements of a partition P s of the vertex set V of 
the graph G. We shall use an auxiliary sequence of graphs which we denote by H l s . 

Let .Po be the partition consisting of isolated vertices of V. We treat the base case s = separately 
to simplify exposition. We use the definition of 7 and rj from lemma IA.1I with A = 1 since we are 
considering 2'-connected components when J = (k, I). 

Base case: s = 0. Set Hq = (Pq, {e±, . . . , et}), i.e. Hq is the partially received graph up to time t. Let 
£q be the the first value of t such that s H t^(et) > 2 l . This means that et* belongs to a 2'-strongly 

connected component in H ° . Note that this component does not contain any (2 l + l)-strongly 
connected components. Denote this component by Co (note that the number of edges in Co is at 
most 2 l I Co I by lemma l2T2~l ). Denote the random variables that correspond to sampling edges of Co 
by Rq. Let Xq be an indicator variable that equals 1 if the number of connected components in 
Cq is at most 7|Co| and otherwise. By lemma lATTI we have that Pr[X = 1] > 1 — e~^ c ^. 
For a partition P denote diag(P) = {(u,u) : u € P}. Define P\ by merging partitions of 
Pq that belong to connected components in Cg if Xq = 1 and as equal to Po otherwise. Let 
E l = E \ (E(Cq) U diag(Pi)), i.e. we remove edges of Co and also edges that connect vertices 
that belong to the same partition in Pi. Note that we can safely remove these edges since their 
endpoints are connected in D j when they arrive. Define H\ = (P\,E^), i.e. H\ is the partially 
received graph on the modified stream of edges. 

Inductive step: s — > s + 1. As in the base case, let t* be the the first value of t such that s H t (et) > 2 l . 

This means that et* belongs to a 2'-connected component in Ptl" . Denote this component by 
Cs(note that the number of edges in C s is at most 2'|C S | by lemma I2l2l) . Denote the random 
variables that correspond to sampling edges of C s by R s . Let X s be an indicator variable that 
equals 1 if the number of connected components in C' s is at most 7|C S | and otherwise. By 
lemma lATTI we have that PrpT<, = 1] > 1 — e~^ Ca ^. Define P s +i by merging together vertices 
that belong to connected components in C' s . Let E s+1 = E s \ (E(C S ) U diag(P s )). Denote 
Hi = (P s ,E?). 

It is important to note that at each step s we only flip coins R s that correspond to edges in E(C S ), 
and delete only those edges from E s . While there may be edges going across partitions P s for which 
we do not perform a coin flip, there number is bounded by 0(2 l n) since these edges do not contain a 
2 l -connected component. 

Note that for any s > the number of connected components in P s is at most 

s 

71-^(1-7)1^1^. 



III 



We now show that it is very unlikely that Ylj=i \Cj\Xj is more than a constant factor smaller than 
^2j=i thus showing that the number of connected components cannot be more than 1 when 
Ylj=i I I > j^z for an appropriate constant c > 0. 

For any constant d > 0define/+ = {i > : \d\ > ((d+2)/r)) log n} and I~ = {i > : \d\ < ((d+ 
2)/ V )logn}. Also define Z+ = Eo<;<*,;e/+ ^C^Z' = V,, y iJ( , Xj \Cj\ — \Cj\(l — e~ v ^ I). 

First note that one has Pr[X,- = 1] > 1 — n~ d ~ 2 for any j G 7 + by lemma lATTI Hence, it follows by 
taking the union bound that i < n 2 one has Pr[Z 4 + = Yljei+ j<i — ^ 



n 



We now consider Z i . Note that Z i 's define a martingale sequence with respect to R4-1, . . . , Rq: 
E[Zr \Ri_i, ...,Ro] = Z~_ v Also, \Zr - Z~_ x \ < ((d + 2) /rf) log n for all i. Hence, by Azuma's 
inequality (see, e.g. BAS0 8 one has 



Pr[Zr < t] < exp 



2i(((d + 2)/7/)logn) 



Now consider the smallest value r such that J2j< T = J2j< T ,jei+ l^il + ^2j<i,jei- I^J'I = ^ + + 



* > ( i_ e -^ )( i_ 7) 

2n/ (1 — 7), then we have that Z^ = S + > 2n/(l — 7) with probability at least 1 



. Note that r < n/(2(l - e~ 2r ')(l - 7)) since \d\ > 2. If 5+ > - 



2n 



(l-e-^)(l- 7 ) 



> 



n 



Thus, 



71 



^(1-7)1^1^ <n-(l- 7 )^ r + <0. 



Otherwise S > 



In 



(i_ e -2'?)(i- 7 ) anc ^ °y Azuma's inequality we have 

2 



Pr[Z T < -n] < exp 



n 



2r(((d + 2)/r?) logn) 2 



< exp 



(((d + 2)/r?)logn) 5 



< n 



Since \d\ > 2, we have \d\(l - e^A) > \d\(l - e~ 2ri ) and thus we get 

\CA(1 - e~^ c ^) + 



^(1-7)1^ <n- (I-7) E 
i =1 Li<j<-r,je/- 

< n - (1 - 7) 



1 



-2t?n 



< n - (1 - 7) 



l<J<T,j€i- 
2n 



1-7 



+ n 



< 



We have shown that there exists a constant d > such that with probability at least 1 — n~ d after d2 l n 
edges are sampled by the algorithm at level J all subsequent edges will have their endpoints connected 
in Dj. Note that we never flipped coins for those edges that did not contain a 2 l -connected component. 
Setting c\ = d + 1, we have that w.h.p. \E \ X J \ < c\2 l n +\E \ X J_1 |. By the inductive hypothesis 
we have that \E \ X J ~ l \ < Yli<j'=Q' k')<j-2 n > which together with the previous estimate gives 
us the desired result. 

It now follows that \E \ X J \ < Y2i<j'=(i' k')<j-i c i^ l ' n = 0(K2 l n) w.h.p., finishing the proof of the 
lemma. 



IV 



